R at scale on the Google Cloud Platform

Mark Edmondson (@HoloMarkeD)

May 20th, 2019 - CopenhagenR

code.markedmondson.me

fgf

Credentials

My R Timeline

  • Digital agencies since 2007
  • useR since 2012 - Motive: how to use all this web data?
  • Shiny enthusiast e.g. https://gallery.shinyapps.io/ga-effect/
  • Google Developer Expert - Google Analytics & Google Cloud
  • Several Google API themed packages on CRAN via googleAuthR
  • Part of cloudyr group (AWS/Azure/GCP R packages for the cloud) https://cloudyr.github.io/
  • Now: Data Engineer @ IIH Nordic

GA Effect

ga-effect

googleAuthRverse

  • searchConsoleR
  • googleAuthR
  • googleAnalyticsR
  • googleComputeEngineR (Cloudyr)
  • bigQueryR (Cloudyr)
  • googleCloudStorageR (Cloudyr)
  • googleLanguageR (rOpenSci)

Slack group to talk around the packages #googleAuthRverse

Scale (almost) always starts with Docker containers

Dockerfiles from The Rocker Project

https://www.rocker-project.org/

Maintain useful R images

  • rocker/r-ver
  • rocker/rstudio
  • rocker/tidyverse
  • rocker/shiny
  • rocker/ml-gpu

Thanks to Rocker Team

rocker-team

Dockerfiles

FROM rocker/tidyverse:3.6.0
MAINTAINER Mark Edmondson (r@sunholo.com)

# install R package dependencies
RUN apt-get update && apt-get install -y \
    libssl-dev 

## Install packages from CRAN
RUN install2.r --error \ 
    -r 'http://cran.rstudio.com' \
    googleAuthR \ 
    googleComputeEngineR \ 
    googleAnalyticsR \ 
    searchConsoleR \ 
    googleCloudStorageR \
    bigQueryR \ 
    ## install Github packages
    && installGithub.r MarkEdmondson1234/youtubeAnalyticsR \
    ## clean up
    && rm -rf /tmp/downloaded_packages/ /tmp/*.rds \

Docker + R = R in Production

  • Flexible No need to ask IT to install R places, just run docker run Cross cloud, ascendent tech

  • Version controlled No worries new package releases will break code

  • Scalable Run multiple Docker containers at once, fits into event-driven, stateless serverless future

Creating Docker images with Cloud Build

Continuous development with GitHub pushes

build-triggers

Scaling R scripts, Shiny apps and APIs

Strategies to scale R

  • Vertical scaling - increase the size and power of one machine
  • Horizontal scaling - split up your problem into lots of little machines
  • Serverless scaling - send your code + data into cloud and let them sort out how many machines

Vertical scaling

Bigger boat

bigger-boat

Bigger VMs

Pros

Probably run the same code with no changes needed
Easy to setup

Cons

Expensive
May be better to have data in database

Launching a monster VM in the cloud

3.75TB of RAM: $423 a day

library(googleComputeEngineR)

# this will cost a lot
bigmem <- gce_vm("big-mem", 
                 template = "rstudio", 
                 predefined_type = "n1-ultramem-160")

RStudio Server

rstudio-server

Standard VM serving Shiny

library(googleComputeEngineR)

custom_image <- gce_tag_container("custom-shiny-app", 
                                  "your-project")
## make new Shiny template VM for your self-contained Shiny app
vm <- gce_vm("myapp", 
             template = "shiny",
             predefined_type = "n1-standard-2",
             dynamic_image = custom_image)

Cloud computing considerations

  • Only charged for uptime, can configure lots of VMs so…
  • Have lots of specialised VMs (Docker images) not one big workstation
  • Keep code and data seperate e.g. googleCloudStorageR or bigQueryR
  • Consider VMs as like functions of computing power

Horizontal scaling

Lots of little machines can accomplish great things

dunkirk

Parellise your code

Pros

Fault redundency
Forces repeatable/reproducable infrastructure
library(future) makes parallel processing very useable

Cons

Changes to your code for split-map-reduce
Write meta code to handle I/O data and code
Not applicable to some problems

Adopt a split-map-reduce mindset

  • Break problems down into stateless lumps
  • Reuseable bricks that can be applied to other tasks

Setup a cluster

New in googleComputeEngineR v0.3 - shortcut that launches cluster, checks authentication for you

library(googleComputeEngineR)

vms <- gce_vm_cluster()
#2019-03-29 23:24:54> # Creating cluster with these arguments:template = r-base,dynamic_image = rocker/r-parallel,wait = 
#FALSE,predefined_type = n1-standard-1
#2019-03-29 23:25:10> Operation running...
...
#2019-03-29 23:25:25> r-cluster-1 VM running
#2019-03-29 23:25:27> r-cluster-2 VM running
#2019-03-29 23:25:29> r-cluster-3 VM running
...
#2019-03-29 23:25:53> # Testing cluster:
r-cluster-1 ssh working
r-cluster-2 ssh working
r-cluster-3 ssh working

library(future)

googleComputeEngineR has custom method for future::as.cluster

## make a future cluster
library(future)
library(googleComputeEngineR)

vms <- gce_vm_cluster()
plan(cluster, workers = as.cluster(vms))

...do parallel...

Forecasting example

# create cluster
vms <- gce_vm_cluster("r-vm", cluster_size = 3)
plan(cluster, workers = as.cluster(vms))

# get data                          
my_files <- list.files("myfolder")
my_data <- lapply(my_files, read.csv)

# forecast data in cluster
library(forecast)
cluster_f <- function(my_data, args = 4){
   forecast(auto.arima(ts(my_data, frequency = args)))
}

result <- future_lapply(my_data, cluster_f, args = 4) 

Multi-layer future loops

Can multi-layer future loops (use each CPU within each VM)

Thanks for Grant McDermott for figuring optimal method (Issue #129)

future_sim <- 
  ## Outer future_lapply() call loops over the no. of VMS
  future_lapply(1:length(vms), FUN = function(x) { 
  
    ## Inner future_lapply() call loops over desired no. of iterations / no. of VMs
    future_lapply(1:(iters/length(vms)), FUN = slow_func) 
    
  })

CPU utilization

3 VMs, 8 CPUs each = 24 threads

Serverless scaling

We spoke previously of

Clusters of VMs + Docker = Horizontal scaling

Kubernetes

Clusters of VMs + Docker + Task controller = Kubernetes

Kubernetes

Pros

Auto-scaling, task queues etc.
Scale to billions
Potentially cheaper
May already have cluster in your organisation

Cons

Needs stateless, idempotent workflows
Message broker?
Minimum 3 VMs

Dockerfiles for Shiny apps

Built on Cloud Build upon GitHub push:

FROM rocker/shiny
MAINTAINER Mark Edmondson (r@sunholo.com)

# install R package dependencies
RUN apt-get update && apt-get install -y \
    libssl-dev
    
## Install packages from CRAN needed for your app
RUN install2.r --error \ 
    -r 'http://cran.rstudio.com' \
    googleAuthR \
    googleAnalyticsR

## assume shiny app is in build folder /shiny
COPY ./shiny/ /srv/shiny-server/myapp/

Kubernetes deployments - Shiny

Shiny App:

kubectl run shiny1 \
  --image gcr.io/gcer-public/shiny-googleauthrdemo:latest \
  --port 3838

kubectl expose deployment shiny1 \
  --target-port=3838  --type=NodePort

Dockerfiles for plumber APIs

Built on Cloud Buid every GitHub push:

FROM trestletech/plumber

# copy your plumbed R script     
COPY api.R /api.R

# default is to run the plumbed script
CMD ["api.R"]

Kubernetes deployments - Plumber

R plumber API:

kubectl run my-plumber \
  --image gcr.io/your-project/my-plumber \
  --port 8000

kubectl expose deployment my-plumber \
  --target-port=8000  --type=NodePort

Shiny apps waiting for service

Expose your workloads via Ingress

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: r-ingress-nginx
spec:
  rules:
  - http:
      paths:
      - path: /gar/
      # app deployed to /gar/shiny/
        backend:
          serviceName: shiny1
          servicePort: 3838

Apps available at URL on demand

curl 'http://mydomain.com/api/echo?msg="its alive!"'
#> "The message is: its alive!"

shiny-app-on-k8s

I thought I knew a bit about R and Google Cloud but then…

GoogleNext19 - Data Science at Scale with R on GCP

A 40 mins talk at Google Next19 with lots of new things to try!

https://www.youtube.com/watch?v=XpNVixSN-Mg&feature=youtu.be

next-intro

New concepts

Great video that goes more into Spark clusters, Jupyter notebooks, training using ML Engine and scaling using Seldon on Kubernetes that I haven’t tried yet

next19

Some shots from the video

Google Cloud Platform - Serverless Pyramid

Google Cloud Platform - R applications

Conclusions

Take-aways

  • Anything scales on Google Cloud Platform, including R
  • Docker docker docker
  • library(future)
  • Pick scaling stategy most suitable for you

Gratitude

  • Thank you for listening
  • Thanks to Kenneth for inviting me
  • Thanks to RStudio for all their cool things. Support them by buying their stuff.
  • Thanks again to Rocker
  • Thanks to Google for Developer Expert programme and building cool stuff.

Say hello afterwards